This report explores Prosper Loan dataset that contain loan information for approximately 114,000 loan records.
Before exploring analysis, I would like to deal with any missing values.
Above chart is showing there are some missing values with ProsperRating..numeric, ProsperRating..Alpha, ProsperScore, and DebtToIncomeRatio, and EmploymentStatus.
First of all, in this section, I would like to perform some preliminary exploration of the dataset.
Original Prosper loan dataset contains approximately 114,000 observations and 81 variables. I have created a subset from original data so that I can only focus on already closed loans. Also, I have picked up only necessary variables. Thus, the dataset in this report includes 55,084 observations with 16 variables as follows.
## [1] 640 480 700 760 620 680 660 740 520 720 600 580 540 820 560 780 500
## [18] 840 860 0 800 460 880 440 420 360
## [1] 55084 30
Structure of the dataset is listed as follows:
## 'data.frame': 55084 obs. of 30 variables:
## $ LoanOriginalAmount : int 9425 3001 1000 4000 10000 3000 2000 4000 4000 10000 ...
## $ BorrowerRate : num 0.158 0.275 0.133 0.318 0.125 ...
## $ StatedMonthlyIncome : num 3083 2083 1667 5500 5833 ...
## $ DebtToIncomeRatio : num 0.17 0.06 0.27 0.49 0.12 0.09 0.39 0.11 0.26 0.11 ...
## $ MonthlyLoanPayment : num 330.4 123.3 33.8 173.7 334.5 ...
## $ ProsperScore : num 0 0 0 5 0 0 5 3 9 9 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 7 2 5 7 4 4 4 6 4 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 3 3 5 2 3 5 3 3 3 ...
## $ EmploymentStatus : Factor w/ 8 levels "Employed","Full-time",..: 8 3 2 5 3 2 1 1 2 1 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 1 1 1 2 1 ...
## $ ProsperRating..Alpha. : chr "NA" "NA" "NA" "HR" ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 260 451 1399 160 488 1655 1609 941 1348 ...
## $ ListingCategory..numeric. : int 0 0 0 13 0 0 15 1 1 20 ...
## $ Term : int 36 36 36 36 36 36 36 60 36 36 ...
## $ ProsperRating..numeric. : int NA NA NA 1 NA NA 3 2 6 6 ...
## $ EmploymentStatusDuration : num 2 0 19 121 0 36 3 25 10 44 ...
## $ CurrentCreditLines : num 5 0 2 10 0 4 7 7 16 4 ...
## $ OpenRevolvingAccounts : int 1 0 1 5 16 3 5 6 11 4 ...
## $ OpenRevolvingMonthlyPayment: num 24 0 40 300 346 15 245 95 294 105 ...
## $ TotalInquiries : num 3 1 5 1 17 8 0 3 8 0 ...
## $ CurrentDelinquencies : num 2 1 3 0 2 1 0 0 0 1 ...
## $ RevolvingCreditBalance : num 0 0 1220 9103 0 ...
## $ BankcardUtilization : num 0 0 0.32 0.97 0 0.08 0.84 0.3 0.09 0.13 ...
## $ AvailableBankcardCredit : num 1500 0 2580 178 0 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ CreditScoreRangeLower : num 640 480 640 700 760 620 680 660 700 740 ...
## $ LoanOriginalAmountRange : Factor w/ 20 levels "$0 - $999","$1,000 - $1,999",..: 9 3 1 4 10 3 2 4 4 10 ...
## $ StatedMonthlyIncomeRange : Factor w/ 20 levels "$0 - $999","$1,000 - $1,999",..: 3 2 1 5 5 3 2 3 6 3 ...
## $ IsDefaulted : chr "Completed" "Completed" "Completed" "Defaulted/Charge-off" ...
## $ MonthlyIncomePaymentDiff : num 0.893 0.941 0.98 0.968 0.943 ...
Summary of each variables are as follows:
## LoanOriginalAmount BorrowerRate StatedMonthlyIncome DebtToIncomeRatio
## Min. : 1000 Min. :0.0000 Min. : 0 Min. : 0.0000
## 1st Qu.: 2600 1st Qu.:0.1350 1st Qu.: 2811 1st Qu.: 0.1000
## Median : 4500 Median :0.1945 Median : 4167 Median : 0.1900
## Mean : 6262 Mean :0.2004 Mean : 5055 Mean : 0.2676
## 3rd Qu.: 8000 3rd Qu.:0.2699 3rd Qu.: 6250 3rd Qu.: 0.2900
## Max. :35000 Max. :0.4975 Max. :618548 Max. :10.0100
##
## MonthlyLoanPayment ProsperScore IncomeRange
## Min. : 0.00 Min. : 0.000 $25,000-49,999:16343
## 1st Qu.: 98.29 1st Qu.: 0.000 $50,000-74,999:12789
## Median : 172.60 Median : 0.000 Not displayed : 7737
## Mean : 223.73 Mean : 2.958 $75,000-99,999: 6442
## 3rd Qu.: 300.47 3rd Qu.: 6.000 $100,000+ : 6064
## Max. :2251.51 Max. :11.000 $1-24,999 : 4571
## (Other) : 1138
## LoanStatus EmploymentStatus IsBorrowerHomeowner
## Completed :38074 Full-time :24957 False:29199
## Chargedoff :11992 Employed :16491 True :25885
## Defaulted : 5018 Not available: 5346
## Cancelled : 0 Self-employed: 2926
## Current : 0 Part-time : 1056
## FinalPaymentInProgress: 0 (Other) : 2056
## (Other) : 0 NA's : 2252
## ProsperRating..Alpha. LoanOriginationDate
## Length:55084 2007-01-17 00:00:00: 137
## Class :character 2011-08-31 00:00:00: 133
## Mode :character 2008-04-30 00:00:00: 128
## 2008-05-28 00:00:00: 126
## 2011-09-30 00:00:00: 121
## 2011-11-30 00:00:00: 120
## (Other) :54319
## ListingCategory..numeric. Term ProsperRating..numeric.
## Min. : 0.000 Min. :12.00 Min. :1.000
## 1st Qu.: 0.000 1st Qu.:36.00 1st Qu.:2.000
## Median : 1.000 Median :36.00 Median :3.000
## Mean : 2.408 Mean :36.94 Mean :3.663
## 3rd Qu.: 3.000 3rd Qu.:36.00 3rd Qu.:5.000
## Max. :20.000 Max. :60.00 Max. :7.000
## NA's :29079
## EmploymentStatusDuration CurrentCreditLines OpenRevolvingAccounts
## Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 9.00 1st Qu.: 4.000 1st Qu.: 3.000
## Median : 41.00 Median : 8.000 Median : 5.000
## Mean : 69.72 Mean : 8.248 Mean : 6.078
## 3rd Qu.: 99.00 3rd Qu.:12.000 3rd Qu.: 8.000
## Max. :755.00 Max. :59.000 Max. :51.000
##
## OpenRevolvingMonthlyPayment TotalInquiries CurrentDelinquencies
## Min. : 0.0 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 60.0 1st Qu.: 2.000 1st Qu.: 0.0000
## Median : 183.0 Median : 5.000 Median : 0.0000
## Mean : 325.3 Mean : 7.016 Mean : 0.8948
## 3rd Qu.: 418.0 3rd Qu.: 9.000 3rd Qu.: 1.0000
## Max. :14985.0 Max. :379.000 Max. :83.0000
##
## RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit
## Min. : 0 Min. :0.0000 Min. : 0
## 1st Qu.: 429 1st Qu.:0.0400 1st Qu.: 92
## Median : 4368 Median :0.4600 Median : 2019
## Mean : 13525 Mean :0.4573 Mean : 9360
## 3rd Qu.: 13750 3rd Qu.:0.8100 3rd Qu.: 9791
## Max. :1435667 Max. :5.9500 Max. :646285
##
## IncomeVerifiable CreditScoreRangeLower LoanOriginalAmountRange
## False: 4300 Min. : 0.0 $1,000 - $1,999: 7933
## True :50784 1st Qu.:620.0 $2,000 - $2,999: 7118
## Median :680.0 $3,000 - $3,999: 6858
## Mean :664.6 $0 - $999 : 6468
## 3rd Qu.:720.0 $4,000 - $4,999: 5612
## Max. :880.0 $6,000 - $6,999: 3959
## (Other) :17136
## StatedMonthlyIncomeRange IsDefaulted MonthlyIncomePaymentDiff
## $2,000 - $2,999: 9434 Length:55084 Min. :-12570.730
## $1,000 - $1,999: 9379 Class :character 1st Qu.: 0.926
## $3,000 - $3,999: 8103 Mode :character Median : 0.958
## $4,000 - $4,999: 6601 Mean : -15.523
## $0 - $999 : 6350 3rd Qu.: 0.977
## $5,000 - $5,999: 4594 Max. : 1.000
## (Other) :10623
Now, I would like to look into the dataset. Firstly, I would like to explore some of the univariate plots.
First histogram is showing right skewed distribution. I have transformed the long tail data to better understand the distriution. The transformed loan amount distribution appears to have unimodal with peaks between 5,000 - 7,000.
I would like to look at other numerical variables as well.
Histogram by Borrwer rate is showing slightly right skewed unimodal distribution with its peak around 0.15.
Histogram by loan term only shows 12, 36 and 60 months. Why are they so cloear cut? According to Prosper website[1], they seem to be offering only fixed term with either 3 or 5 years.
[1] https://www.prosper.com/ Prosper website
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2811 4167 5055 6250 618548
Next, I would like to look across categorical variables.
Firstly let’s look at the distribution by Income Range.
I was expecting to see more right skewed distribution, in ohter words, tendecy like borrower with lower income tend to borrow more. However, the plot shows unimodal distribution with peak around $25,000 and $50,000.
## [1] 0.8247404
This plot shows distribution by Employment status. 82.4 % of the borrowers are employed in some form, i.e. either ‘Employed’,‘Full-time’, ‘Part-time’, or ‘Self-employed’.
## [1] 0.4699187
Above plot showing distribution by whether or not the borrower is home owner. 46.9% of the borrowers are home owners.
Next, let’s have alook at distribution by risk indicators.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.958 6.000 11.000
Above plot by Prosper score is showing slightly left skewed unimodal distribution with its peak around 8 with median of 6 and most of them are between 5 and 8.
This plot shows distribution by Prosper rating. Borrowers seem to be distributed almost equally across each ratings, except ‘D’ has the prominently high density.
Also listed the summary for each ratings as below.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2600 4500 6262 8000 35000
## loandata_completed$ProsperRating..Alpha.: A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3500 6000 8041 11000 35000
## --------------------------------------------------------
## loandata_completed$ProsperRating..Alpha.: AA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 6500 8178 12000 25000
## --------------------------------------------------------
## loandata_completed$ProsperRating..Alpha.: B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7500 8559 12500 35000
## --------------------------------------------------------
## loandata_completed$ProsperRating..Alpha.: C
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 6000 7547 12000 25000
## --------------------------------------------------------
## loandata_completed$ProsperRating..Alpha.: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 5000 6033 8000 15000
## --------------------------------------------------------
## loandata_completed$ProsperRating..Alpha.: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 4000 4300 5500 15900
## --------------------------------------------------------
## loandata_completed$ProsperRating..Alpha.: HR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 4000 3397 4000 16800
## --------------------------------------------------------
## loandata_completed$ProsperRating..Alpha.: NA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 4500 6160 7920 25000
Finally, I would like to see what are the proportion of defaulted/charged-off loans.
The first plot shows the proportion of completed loans, defaulted loans, and charged-off loans. In the second plot, I just combined defaulted loans and charge-off loans togerther, to compare with the completed loans.
## [1] 0.6911989
Almost 70 % of closed loans are completed without being defaulted/charge-off.
Original Prosperloan dataset consists of 81 variables, with approximately 114,000 observations. I have created a subset of 55,084 observations that have closed status and also only use 16 variables.
The main features of interest are Categorical factor variables listed as below:
Other variables such as BorrowerRate, StatedMonthlyIncome, DebtToIncomeRatio, MonthlyLoanPayment, and Term, likely contribute to further analysis.
Firstly I have plotted a correlation matrix as below.
Now, I would like to look into the relation among individual variables more closely.
Firstly, I would like to see the relation between Prosper Score and Monthly Income, and also Prosper rating and monthly income, that are showing the strongest correlation in above chart.
Next, I would like to see how borrowers rate differs across Prosper rating and Prosper score.
Next, I would like to see the correlation between Borrower rate and Monthly Income that also showing quite strong correlation of -0.0691 in the first correlation matrix.
The lm line showing the negative trend between Stated Monthly Income and Borrower rate. The hihger the monthly income, the lower the rate.
Next, I would like to see how the original loan amount varies by Prosper ratings.
It becomes more obvious when comparing the total in number and in amount for each ratings.
Next, I would like to explore what factors could affect whether or not a loan is to be defaulted?
Firstly, I would like to look this by loan amount.
Next, I would like to look into the distribution by monthly payment amount.
## loandata_completed$IsDefaulted: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 94.15 171.10 218.78 297.00 2251.51
## --------------------------------------------------------
## loandata_completed$IsDefaulted: Defaulted/Charge-off
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 107.4 173.7 234.8 307.2 1552.8
What can we see if we look this from the borrower rate?
Above plot compares the distribution by borrower rate.
As we have seen earlier, the borrower rate seems to be determined based on risk factors, such as Prosper rating and Prosper scores.
Let’s have a look at the distribution by rating and score to see how it differs between completed loans and defaullted loans.
## loandata_completed$IsDefaulted: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 2.000 3.376 7.000 11.000
## --------------------------------------------------------
## loandata_completed$IsDefaulted: Defaulted/Charge-off
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.023 4.000 11.000
** The proportion of AA and A seem to be much smaller for Defaulted/Charge-off loans than Completed loans.
** Lower ratings such as E and HR have larger proportion with defaulted loans.
** Defaulted/Charge-off loans has its peak around 5-6 whereas Completed loans has around 8.
** Summary shows 3rd quatile of Defaulted/charge-off loans is 7, whereas that of Completed loans is 8.
How did the feature(s) of interest vary with other features in the dataset?
I would like to look in the separated plots as below to see how the loan amount differs by Prosper rating.
Seemingly, borrowers with higher rating tend to borrow more amount with lower rate than lower graded borrowers.
Do the borrower with better rating or score repay more?
## loandata_completed$IsDefaulted: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2917 4417 5324 6583 618548
## --------------------------------------------------------
## loandata_completed$IsDefaulted: Defaulted/Charge-off
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2500 3750 4451 5417 208333
Finally I would like to see how the defaulted loans are distributed across Prosper rating and Prosper score on below scatter plot.
It seems Prosper rating and Prosper score have quite strong relation though it is difficult to say if this relation is strengthening each other. However, these 2 variables seems to be correaltes also with borrower rate.
It is not surprising to see there is strong correlation between risk indicators and borrower rate.
N/A
For defaulted loans, the borrower rate range is distributed more in higher rates, comparing to the completed loans. This may suggest that the higher the loan rate, the more difficult for borrowers to complete the loan.
Ratings and Scores seem to have quite strong correlation with borrower rate. These risk indicators semms to be used as a basis when determining borrower rate.
So, are these risk indicator effective to reduce the risk of default?
Seemingly, defaulted/charge-off loans are distributed more in lower ratings such as D, E, and HR. However, defaulted loans still can be seen even with higher ratings like AA or A.
After completed the Udacity nanodegree course, revisited this dataset with attempt to establish a logistic regression model.
### prepare a sample dataset
set.seed(20022012)
### add new variable Defaulted by converting IsDefaulted to boolean
loandata_completed$Defaulted <- ifelse(
loandata_completed$LoanStatus == 'Defaulted' |
loandata_completed$LoanStatus == 'Chargedoff', TRUE,
FALSE)
loandata_completed_train<- subset(
loandata_completed,
select = c(LoanOriginalAmountRange,
BorrowerRate,
StatedMonthlyIncomeRange,
DebtToIncomeRatio,
MonthlyLoanPayment,
ProsperScore,
ProsperRating..Alpha.,
CurrentCreditLines,
EmploymentStatusDuration,
OpenRevolvingAccounts,
OpenRevolvingMonthlyPayment,
BankcardUtilization,
RevolvingCreditBalance,
AvailableBankcardCredit,
TotalInquiries,
CurrentDelinquencies,
Term,
IncomeVerifiable,
CreditScoreRangeLower,
Defaulted
))
###split data in to train and test dataset
library(caTools)
set.seed(144)
spl = sample.split(loandata_completed_train$Defaulted, 0.7)
train_set = subset(loandata_completed_train, spl == TRUE)
test_set = subset(loandata_completed_train, spl == FALSE)
### fitting
#model <- glm(Defaulted~.,family=binomial(link='logit'),data=loandata_completed_train)
model <- glm(Defaulted~.,family=binomial(link='logit'),data=train_set)
##
## Call:
## glm(formula = Defaulted ~ ., family = binomial(link = "logit"),
## data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7741 -0.8352 -0.5553 1.0165 2.7604
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -3.602e+00 2.001e-01 -18.004
## LoanOriginalAmountRange$1,000 - $1,999 3.077e-01 5.094e-02 6.040
## LoanOriginalAmountRange$2,000 - $2,999 3.577e-01 5.385e-02 6.643
## LoanOriginalAmountRange$3,000 - $3,999 5.096e-01 6.013e-02 8.474
## LoanOriginalAmountRange$4,000 - $4,999 4.807e-01 6.531e-02 7.360
## LoanOriginalAmountRange$5,000 - $5,999 2.813e-01 8.138e-02 3.457
## LoanOriginalAmountRange$6,000 - $6,999 4.540e-01 8.260e-02 5.496
## LoanOriginalAmountRange$7,000 - $7,999 3.947e-01 1.036e-01 3.810
## LoanOriginalAmountRange$8,000 - $8,999 3.788e-01 1.128e-01 3.358
## LoanOriginalAmountRange$9,000 - $9,999 3.155e-01 1.046e-01 3.016
## LoanOriginalAmountRange$10,000 - $10,999 6.821e-02 1.650e-01 0.413
## LoanOriginalAmountRange$11,000 - $11,999 1.200e-01 1.428e-01 0.840
## LoanOriginalAmountRange$12,000 - $12,999 -1.739e-02 1.739e-01 -0.100
## LoanOriginalAmountRange$13,000 - $13,999 1.367e-01 1.884e-01 0.726
## LoanOriginalAmountRange$14,000 - $14,999 1.036e-01 1.422e-01 0.728
## LoanOriginalAmountRange$15,000 - $15,999 8.235e-02 2.461e-01 0.335
## LoanOriginalAmountRange$16,000 - $16,999 1.997e-01 2.443e-01 0.817
## LoanOriginalAmountRange$17,000 - $17,999 2.172e-02 2.469e-01 0.088
## LoanOriginalAmountRange$18,000 - $19,999 -4.086e-01 3.299e-01 -1.238
## LoanOriginalAmountRange$20,000 or more -4.410e-02 2.035e-01 -0.217
## BorrowerRate 7.057e+00 2.506e-01 28.157
## StatedMonthlyIncomeRange$1,000 - $1,999 -3.683e-02 4.512e-02 -0.816
## StatedMonthlyIncomeRange$2,000 - $2,999 -3.522e-01 4.671e-02 -7.540
## StatedMonthlyIncomeRange$3,000 - $3,999 -5.268e-01 4.948e-02 -10.646
## StatedMonthlyIncomeRange$4,000 - $4,999 -5.639e-01 5.268e-02 -10.705
## StatedMonthlyIncomeRange$5,000 - $5,999 -6.610e-01 5.941e-02 -11.126
## StatedMonthlyIncomeRange$6,000 - $6,999 -8.260e-01 6.825e-02 -12.103
## StatedMonthlyIncomeRange$7,000 - $7,999 -9.607e-01 7.916e-02 -12.136
## StatedMonthlyIncomeRange$8,000 - $8,999 -9.881e-01 1.008e-01 -9.800
## StatedMonthlyIncomeRange$9,000 - $9,999 -9.920e-01 9.963e-02 -9.957
## StatedMonthlyIncomeRange$10,000 - $10,999 -1.099e+00 1.470e-01 -7.475
## StatedMonthlyIncomeRange$11,000 - $11,999 -9.183e-01 1.351e-01 -6.799
## StatedMonthlyIncomeRange$12,000 - $12,999 -9.698e-01 1.805e-01 -5.374
## StatedMonthlyIncomeRange$13,000 - $13,999 -1.061e+00 2.275e-01 -4.661
## StatedMonthlyIncomeRange$14,000 - $14,999 -1.151e+00 2.156e-01 -5.340
## StatedMonthlyIncomeRange$15,000 - $15,999 -7.864e-01 2.043e-01 -3.849
## StatedMonthlyIncomeRange$16,000 - $16,999 -1.359e+00 3.585e-01 -3.792
## StatedMonthlyIncomeRange$17,000 - $17,999 -1.372e+00 4.047e-01 -3.390
## StatedMonthlyIncomeRange$18,000 - $19,999 -7.237e-01 3.992e-01 -1.813
## StatedMonthlyIncomeRange$20,000 or more -1.484e+00 1.591e-01 -9.326
## DebtToIncomeRatio 3.028e-02 1.687e-02 1.795
## MonthlyLoanPayment 2.220e-03 2.413e-04 9.199
## ProsperScore 3.452e-02 1.247e-02 2.767
## ProsperRating..Alpha.AA -5.345e-01 1.555e-01 -3.437
## ProsperRating..Alpha.B 2.085e-02 8.785e-02 0.237
## ProsperRating..Alpha.C -8.297e-02 8.819e-02 -0.941
## ProsperRating..Alpha.D -8.446e-02 8.882e-02 -0.951
## ProsperRating..Alpha.E -6.736e-02 1.052e-01 -0.640
## ProsperRating..Alpha.HR 1.545e-01 1.125e-01 1.373
## ProsperRating..Alpha.NA 8.383e-01 1.299e-01 6.455
## CurrentCreditLines 1.164e-02 3.551e-03 3.279
## EmploymentStatusDuration 4.705e-04 1.582e-04 2.974
## OpenRevolvingAccounts -3.930e-02 4.873e-03 -8.066
## OpenRevolvingMonthlyPayment 4.364e-04 5.112e-05 8.538
## BankcardUtilization -1.869e-01 3.982e-02 -4.693
## RevolvingCreditBalance -1.092e-06 6.418e-07 -1.702
## AvailableBankcardCredit -3.054e-06 9.659e-07 -3.162
## TotalInquiries 3.726e-02 1.758e-03 21.192
## CurrentDelinquencies 9.294e-02 5.599e-03 16.598
## Term 3.077e-02 2.384e-03 12.908
## IncomeVerifiableTrue -1.491e-01 4.678e-02 -3.187
## CreditScoreRangeLower -1.363e-03 1.361e-04 -10.009
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## LoanOriginalAmountRange$1,000 - $1,999 1.54e-09 ***
## LoanOriginalAmountRange$2,000 - $2,999 3.08e-11 ***
## LoanOriginalAmountRange$3,000 - $3,999 < 2e-16 ***
## LoanOriginalAmountRange$4,000 - $4,999 1.84e-13 ***
## LoanOriginalAmountRange$5,000 - $5,999 0.000547 ***
## LoanOriginalAmountRange$6,000 - $6,999 3.88e-08 ***
## LoanOriginalAmountRange$7,000 - $7,999 0.000139 ***
## LoanOriginalAmountRange$8,000 - $8,999 0.000784 ***
## LoanOriginalAmountRange$9,000 - $9,999 0.002561 **
## LoanOriginalAmountRange$10,000 - $10,999 0.679366
## LoanOriginalAmountRange$11,000 - $11,999 0.400664
## LoanOriginalAmountRange$12,000 - $12,999 0.920344
## LoanOriginalAmountRange$13,000 - $13,999 0.467923
## LoanOriginalAmountRange$14,000 - $14,999 0.466356
## LoanOriginalAmountRange$15,000 - $15,999 0.737929
## LoanOriginalAmountRange$16,000 - $16,999 0.413646
## LoanOriginalAmountRange$17,000 - $17,999 0.929896
## LoanOriginalAmountRange$18,000 - $19,999 0.215537
## LoanOriginalAmountRange$20,000 or more 0.828435
## BorrowerRate < 2e-16 ***
## StatedMonthlyIncomeRange$1,000 - $1,999 0.414326
## StatedMonthlyIncomeRange$2,000 - $2,999 4.72e-14 ***
## StatedMonthlyIncomeRange$3,000 - $3,999 < 2e-16 ***
## StatedMonthlyIncomeRange$4,000 - $4,999 < 2e-16 ***
## StatedMonthlyIncomeRange$5,000 - $5,999 < 2e-16 ***
## StatedMonthlyIncomeRange$6,000 - $6,999 < 2e-16 ***
## StatedMonthlyIncomeRange$7,000 - $7,999 < 2e-16 ***
## StatedMonthlyIncomeRange$8,000 - $8,999 < 2e-16 ***
## StatedMonthlyIncomeRange$9,000 - $9,999 < 2e-16 ***
## StatedMonthlyIncomeRange$10,000 - $10,999 7.74e-14 ***
## StatedMonthlyIncomeRange$11,000 - $11,999 1.06e-11 ***
## StatedMonthlyIncomeRange$12,000 - $12,999 7.70e-08 ***
## StatedMonthlyIncomeRange$13,000 - $13,999 3.14e-06 ***
## StatedMonthlyIncomeRange$14,000 - $14,999 9.30e-08 ***
## StatedMonthlyIncomeRange$15,000 - $15,999 0.000118 ***
## StatedMonthlyIncomeRange$16,000 - $16,999 0.000150 ***
## StatedMonthlyIncomeRange$17,000 - $17,999 0.000700 ***
## StatedMonthlyIncomeRange$18,000 - $19,999 0.069846 .
## StatedMonthlyIncomeRange$20,000 or more < 2e-16 ***
## DebtToIncomeRatio 0.072653 .
## MonthlyLoanPayment < 2e-16 ***
## ProsperScore 0.005651 **
## ProsperRating..Alpha.AA 0.000589 ***
## ProsperRating..Alpha.B 0.812395
## ProsperRating..Alpha.C 0.346833
## ProsperRating..Alpha.D 0.341676
## ProsperRating..Alpha.E 0.521984
## ProsperRating..Alpha.HR 0.169764
## ProsperRating..Alpha.NA 1.08e-10 ***
## CurrentCreditLines 0.001041 **
## EmploymentStatusDuration 0.002942 **
## OpenRevolvingAccounts 7.28e-16 ***
## OpenRevolvingMonthlyPayment < 2e-16 ***
## BankcardUtilization 2.69e-06 ***
## RevolvingCreditBalance 0.088737 .
## AvailableBankcardCredit 0.001568 **
## TotalInquiries < 2e-16 ***
## CurrentDelinquencies < 2e-16 ***
## Term < 2e-16 ***
## IncomeVerifiableTrue 0.001438 **
## CreditScoreRangeLower < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 47669 on 38558 degrees of freedom
## Residual deviance: 41058 on 38497 degrees of freedom
## AIC: 41182
##
## Number of Fisher Scoring iterations: 5
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Defaulted
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 38558 47669
## LoanOriginalAmountRange 19 124.87 38539 47545 < 2.2e-16 ***
## BorrowerRate 1 2867.17 38538 44677 < 2.2e-16 ***
## StatedMonthlyIncomeRange 19 645.42 38519 44032 < 2.2e-16 ***
## DebtToIncomeRatio 1 18.69 38518 44013 1.536e-05 ***
## MonthlyLoanPayment 1 11.42 38517 44002 0.0007246 ***
## ProsperScore 1 1274.22 38516 42728 < 2.2e-16 ***
## ProsperRating..Alpha. 7 396.33 38509 42331 < 2.2e-16 ***
## CurrentCreditLines 1 25.02 38508 42306 5.680e-07 ***
## EmploymentStatusDuration 1 4.50 38507 42302 0.0339317 *
## OpenRevolvingAccounts 1 13.97 38506 42288 0.0001853 ***
## OpenRevolvingMonthlyPayment 1 46.92 38505 42241 7.382e-12 ***
## BankcardUtilization 1 52.30 38504 42189 4.752e-13 ***
## RevolvingCreditBalance 1 2.54 38503 42186 0.1108529
## AvailableBankcardCredit 1 24.64 38502 42161 6.908e-07 ***
## TotalInquiries 1 515.54 38501 41646 < 2.2e-16 ***
## CurrentDelinquencies 1 315.59 38500 41330 < 2.2e-16 ***
## Term 1 170.04 38499 41160 < 2.2e-16 ***
## IncomeVerifiable 1 5.99 38498 41154 0.0143825 *
## CreditScoreRangeLower 1 96.53 38497 41058 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results: * BorrowerRate and StatedMonthlyIncome have the lowest p-value of 2e-16. This suggests strong association of between these two variables and the probability of the loan to go default. * EmploymentStatusDuration, IncomeVerifiable, and RevolvingCreditBalance have less correlation.
Memo: * When running anova(), first attempt caused perfect separation warning and LoanOriginalAmount seems to be responsible for this warning. * Applied binning on LoanOriginalAmount, but still throwing the warning. StatedMonthlyIncome also seems to be causing this warning. * After binning both LoanOriginalAmount and StatedMonthlyIncome, the warning is no longer thrown.
### predict
fitted.results <- predict(model,newdata=test_set,type='response')
### examine accuracy
threshold = 0.5
fitted.results <- ifelse(fitted.results > threshold,1,0)
misClasificError <- mean(fitted.results != test_set$Defaulted)
print(paste("Misclacific err::",misClasificError))
## [1] "Misclacific err:: 0.275824508320726"
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.724175491679274"
# compute McFadden R2
print("R2: ")
## [1] "R2: "
pR2(model)
## llh llhNull G2 McFadden r2ML
## -2.052888e+04 -2.383474e+04 6.611714e+03 1.386991e-01 1.575745e-01
## r2CU
## 2.220813e-01
### Confusion Matrix
confusion_matrix <- table(test_set$Defaulted, as.numeric(fitted.results >= threshold))
cat("\n")
print("Confusion Matrix: ")
## [1] "Confusion Matrix: "
confusion_matrix
##
## 0 1
## FALSE 10427 995
## TRUE 3563 1540
tn <- confusion_matrix[1]
fn <- confusion_matrix[2]
fp <- confusion_matrix[3]
tp <- confusion_matrix[4]
cat("\n")
print(paste('True Negative: ',tn))
## [1] "True Negative: 10427"
print(paste('False Negative: ',fn))
## [1] "False Negative: 3563"
print(paste('False Positive: ',fp))
## [1] "False Positive: 995"
print(paste('True Positive: ',tp))
## [1] "True Positive: 1540"
print(paste('Recall: ', tp/(tp + fn)))
## [1] "Recall: 0.301783264746228"
print(paste('Precision: ',tp/(tp + fp)))
## [1] "Precision: 0.607495069033531"
# accuracy based on above confusion matrix
print(paste('Accuracy:', (tn + tp)/sum(confusion_matrix)))
## [1] "Accuracy: 0.724175491679274"
We got accuracy of 72.4% Let’s examine the accuracy by conducting coross validation.
predictors_vector <- c("BorrowerRate", "DebtToIncomeRatio","LoanOriginalAmountRange","StatedMonthlyIncomeRange","MonthlyLoanPayment",
"ProsperScore","ProsperRating..Alpha.",
"CurrentCreditLines","EmploymentStatusDuration","Term",
"BankcardUtilization","RevolvingCreditBalance","AvailableBankcardCredit","TotalInquiries","CurrentDelinquencies"
)
predictors_vector_sum <- paste(predictors_vector, collapse="+")
logReg_model_fmla <- as.formula(paste("Defaulted ~ ", predictors_vector_sum))
fitControl = trainControl( method = "cv", number = 10, p = 0.70 )
### Convert `Defaulted` column from boolean to factor
#train_set$Defaulted <-as.integer(as.logical(train_set$Defaulted))
train_set$Defaulted <-as.factor(as.logical(train_set$Defaulted))
logReg_model_cv <- train(logReg_model_fmla, data=train_set, method = "bayesglm", trControl = fitControl)
logReg_model_cv$resample$Accuracy
## [1] 0.7160270 0.7203632 0.7321058 0.7249157 0.7282158 0.7266598 0.7282158
## [8] 0.7282158 0.7334025 0.7260700
mean(logReg_model_cv$resample$Accuracy)
## [1] 0.7264191
Above cross validation results seem to match the predicted accuracy of 72.4%.
p <- predict(model, newdata=test_set, type="response")
pr <- prediction(p, test_set$Defaulted)
# Performance function
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
# Plot ROC curve
plot(prf,colorize=TRUE,print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
grid()
# AUC
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
print(paste("AUC:", auc))
## [1] "AUC: 0.734155867332897"
# Recall-Precision curve
prf <- performance(pr, measure = "prec", x.measure = "rec")
plot (prf);
# F1 score
prf <- performance(pr, "f")
plot (prf);
We have used threshold of 0.5 in previous analysis and we succeeded in identifying 1,540 true positives, but still missing 3,563 false positives.
Let’s see if changing threshold improve the prediction ability.
fitted.results <- predict(model,newdata=test_set,type='response')
thresholds <- seq(0.1,1.0,by=0.05)
for(threshold in thresholds){
results.over.threshold <- ifelse(fitted.results > threshold,1,0)
misClasificError <- mean(results.over.threshold != test_set$Defaulted)
mean(results.over.threshold)
confusion_matrix <- table(test_set$Defaulted, as.numeric(results.over.threshold >= threshold))
print(paste('Threshold: ',threshold))
print(confusion_matrix)
print(misClasificError)
print(paste('Accuracy',1-misClasificError))
tn <- confusion_matrix[1]
fn <- confusion_matrix[2]
fp <- confusion_matrix[3]
tp <- confusion_matrix[4]
print(paste('False Negative: ',fn))
print(paste('True Positive: ',tp))
recall <- tp/(tp + fn)
precision <- tp/(tp + fp)
f1 <- 2* precision * recall /(precision + recall)
print(paste('Recall: ', recall))
print(paste('Precision: ',precision))
print(paste('F1: ',f1))
cat("\n")
}
## [1] "Threshold: 0.1"
##
## 0 1
## FALSE 1786 9636
## TRUE 117 4986
## [1] 0.5901967
## [1] "Accuracy 0.409803328290469"
## [1] "False Negative: 117"
## [1] "True Positive: 4986"
## [1] "Recall: 0.977072310405644"
## [1] "Precision: 0.340993024210094"
## [1] "F1: 0.505551330798479"
##
## [1] "Threshold: 0.15"
##
## 0 1
## FALSE 3214 8208
## TRUE 304 4799
## [1] 0.5150983
## [1] "Accuracy 0.484901664145235"
## [1] "False Negative: 304"
## [1] "True Positive: 4799"
## [1] "Recall: 0.940427199686459"
## [1] "Precision: 0.368955177981087"
## [1] "F1: 0.529983434566538"
##
## [1] "Threshold: 0.2"
##
## 0 1
## FALSE 4760 6662
## TRUE 635 4468
## [1] 0.4415734
## [1] "Accuracy 0.558426626323752"
## [1] "False Negative: 635"
## [1] "True Positive: 4468"
## [1] "Recall: 0.875563394081913"
## [1] "Precision: 0.401437556154537"
## [1] "F1: 0.550483582825109"
##
## [1] "Threshold: 0.25"
##
## 0 1
## FALSE 6163 5259
## TRUE 1089 4014
## [1] 0.3841452
## [1] "Accuracy 0.615854765506808"
## [1] "False Negative: 1089"
## [1] "True Positive: 4014"
## [1] "Recall: 0.786596119929453"
## [1] "Precision: 0.432869621481721"
## [1] "F1: 0.558430717863105"
##
## [1] "Threshold: 0.3"
##
## 0 1
## FALSE 7402 4020
## TRUE 1584 3519
## [1] 0.3391225
## [1] "Accuracy 0.660877458396369"
## [1] "False Negative: 1584"
## [1] "True Positive: 3519"
## [1] "Recall: 0.689594356261023"
## [1] "Precision: 0.466772781536013"
## [1] "F1: 0.55671570953963"
##
## [1] "Threshold: 0.35"
##
## 0 1
## FALSE 8459 2963
## TRUE 2117 2986
## [1] 0.307413
## [1] "Accuracy 0.692586989409985"
## [1] "False Negative: 2117"
## [1] "True Positive: 2986"
## [1] "Recall: 0.5851459925534"
## [1] "Precision: 0.501933097999664"
## [1] "F1: 0.540354686934491"
##
## [1] "Threshold: 0.4"
##
## 0 1
## FALSE 9300 2122
## TRUE 2625 2478
## [1] 0.2872617
## [1] "Accuracy 0.712738275340393"
## [1] "False Negative: 2625"
## [1] "True Positive: 2478"
## [1] "Recall: 0.48559670781893"
## [1] "Precision: 0.538695652173913"
## [1] "F1: 0.510769864990209"
##
## [1] "Threshold: 0.45"
##
## 0 1
## FALSE 9958 1464
## TRUE 3125 1978
## [1] 0.2777005
## [1] "Accuracy 0.722299546142209"
## [1] "False Negative: 3125"
## [1] "True Positive: 1978"
## [1] "Recall: 0.387615128355869"
## [1] "Precision: 0.5746658919233"
## [1] "F1: 0.46296079578701"
##
## [1] "Threshold: 0.5"
##
## 0 1
## FALSE 10427 995
## TRUE 3563 1540
## [1] 0.2758245
## [1] "Accuracy 0.724175491679274"
## [1] "False Negative: 3563"
## [1] "True Positive: 1540"
## [1] "Recall: 0.301783264746228"
## [1] "Precision: 0.607495069033531"
## [1] "F1: 0.403246923278345"
##
## [1] "Threshold: 0.55"
##
## 0 1
## FALSE 10720 702
## TRUE 3933 1170
## [1] 0.2804841
## [1] "Accuracy 0.719515885022693"
## [1] "False Negative: 3933"
## [1] "True Positive: 1170"
## [1] "Recall: 0.229276895943563"
## [1] "Precision: 0.625"
## [1] "F1: 0.335483870967742"
##
## [1] "Threshold: 0.6"
##
## 0 1
## FALSE 10958 464
## TRUE 4171 932
## [1] 0.2804841
## [1] "Accuracy 0.719515885022693"
## [1] "False Negative: 4171"
## [1] "True Positive: 932"
## [1] "Recall: 0.182637664119146"
## [1] "Precision: 0.667621776504298"
## [1] "F1: 0.286813355900908"
##
## [1] "Threshold: 0.65"
##
## 0 1
## FALSE 11134 288
## TRUE 4405 698
## [1] 0.2839939
## [1] "Accuracy 0.716006051437216"
## [1] "False Negative: 4405"
## [1] "True Positive: 698"
## [1] "Recall: 0.136782284930433"
## [1] "Precision: 0.707910750507099"
## [1] "F1: 0.229265889308589"
##
## [1] "Threshold: 0.7"
##
## 0 1
## FALSE 11256 166
## TRUE 4622 481
## [1] 0.2897428
## [1] "Accuracy 0.710257186081694"
## [1] "False Negative: 4622"
## [1] "True Positive: 481"
## [1] "Recall: 0.0942582794434646"
## [1] "Precision: 0.743431221020093"
## [1] "F1: 0.167304347826087"
##
## [1] "Threshold: 0.75"
##
## 0 1
## FALSE 11327 95
## TRUE 4760 343
## [1] 0.2937973
## [1] "Accuracy 0.706202723146747"
## [1] "False Negative: 4760"
## [1] "True Positive: 343"
## [1] "Recall: 0.0672153635116598"
## [1] "Precision: 0.78310502283105"
## [1] "F1: 0.1238043674427"
##
## [1] "Threshold: 0.8"
##
## 0 1
## FALSE 11366 56
## TRUE 4908 195
## [1] 0.3003933
## [1] "Accuracy 0.699606656580938"
## [1] "False Negative: 4908"
## [1] "True Positive: 195"
## [1] "Recall: 0.0382128159905938"
## [1] "Precision: 0.776892430278884"
## [1] "F1: 0.0728427344041838"
##
## [1] "Threshold: 0.85"
##
## 0 1
## FALSE 11390 32
## TRUE 4991 112
## [1] 0.3039637
## [1] "Accuracy 0.696036308623298"
## [1] "False Negative: 4991"
## [1] "True Positive: 112"
## [1] "Recall: 0.0219478737997256"
## [1] "Precision: 0.777777777777778"
## [1] "F1: 0.0426910615589861"
##
## [1] "Threshold: 0.9"
##
## 0 1
## FALSE 11404 18
## TRUE 5047 56
## [1] 0.3065053
## [1] "Accuracy 0.693494704992436"
## [1] "False Negative: 5047"
## [1] "True Positive: 56"
## [1] "Recall: 0.0109739368998628"
## [1] "Precision: 0.756756756756757"
## [1] "F1: 0.0216341510527332"
##
## [1] "Threshold: 0.95"
##
## 0 1
## FALSE 11416 6
## TRUE 5086 17
## [1] 0.3081392
## [1] "Accuracy 0.691860816944024"
## [1] "False Negative: 5086"
## [1] "True Positive: 17"
## [1] "Recall: 0.00333137370174407"
## [1] "Precision: 0.739130434782609"
## [1] "F1: 0.00663285212641436"
##
## [1] "Threshold: 1"
##
## 0
## FALSE 11422
## TRUE 5103
## [1] 0.3088048
## [1] "Accuracy 0.691195158850227"
## [1] "False Negative: 5103"
## [1] "True Positive: NA"
## [1] "Recall: NA"
## [1] "Precision: NA"
## [1] "F1: NA"
Threshold of 0.4 seems to score most balanced result, we have F1 score of 0.51 that is closest to 0.5, precision higher than 0.5, and also accuracy higher than 71%.